Comparing Keyword Extraction Techniques for WEBSOM Text Archives
نویسندگان
چکیده
The WEBSOM methodology for building very large text archives has a very slow method for extracting meaningful unit labels. This is because the method computes for the relative frequencies of all the words of all the documents associated to each unit and then compares these to the relative frequencies of all the words of all the other units of the map. Since maps may have more than 100,000 units and the archive may contain up to 7 million documents, the existing WEBSOM method is not practical. A fast alternative method is based on the distribution of weights in the weight vectors of the trained map, plus a simple manipulation of the random projection matrix used for input data compression. Comparisons made using a WEBSOM archive of the Reuters text collection reveal that a high percentage of keywords extracted using this method match the keywords extracted for the same map units using the original WEBSOM method. 1. Building Large WEBSOM Text Archives “Self-Organizing Maps” (SOM), and most prominently the WEBSOM, have been shown to scale up to very large document collections [1-10]. However, being used mainly with data that are not pre-labeled, SOMs need automatic procedures for extracting keywords of archived documents if some information about the document clusters were to be given to the user. Knowing the top keywords per unit allows assigning non-uniform weights to the different dimensions of centroids-based classification algorithms [11, 12, 13]. Central to these techniques, of course, is an effective way of knowing which dimensions (keywords) should receive more weight. Likewise, in hierarchical SOMs [2, 3, 4, 5], it is useful to allocate different weight distributions to different layers of the tree. There again, it is important to know which are the central keywords per unit. Furthermore, being able to explain why certain documents are grouped together is important to studies on clustering of documents [14]. To do this, we should be able to isolate the major keywords that characterize each unit in the map. Finally, knowing the keywords associated with the units allows the user to view the label distribution and “guess” where the interesting documents are. Extracting keywords is not straightforward because of a random projection method that is employed to compress the large but sparse input term frequency vectors. Some previous work has been done on keyword extraction for SOM-based archives [4, 5, 9]. In fact, the WEBSOM methodology does include an automatic keyword extraction procedure [9], but the procedure is very slow. It computes the relative frequencies of all the words of all the documents associated to each unit and then compares these to the relative frequencies of words of the other units of the map. Since current WEBSOM text archives have more than 100,000 units and may contain up to 7 million documents, the existing keyword extraction method is not practical. This paper is organized as follows. Section 2 describes the process of deducing the most important keywords. The keyword deduction method is illustrated in section 3 using a WEBSOM-based archive of the well known Reuters text collection. Comparisons of our keyword selection technique with the original WEBSOM keyword selection method are presented in Section 4. 2. Extracting Meaningful Labels The most critical aspect of SOM-based text archiving is the compression of the initial text dataset into a size that is manageable as far as SOM training, labeling, and archiving are concerned this without losing too much of the original information content necessary for effective text classification and archiving. First reported in Kohonen [8, 10], a random projection method can radically reduce the dimensionality of the document encodings. Given a document vector ni ∈ R, where the elements of the vector are normalized term frequencies after performing feature selection, and given a random m x n matrix R whose elements per column are normally distributed. One can compute the projection xi ∈ R of the original document vector ni on a much lower dimensional space, i.e., m<< n, using xi = R ni. Kohonen [8, 10] reports that the similarity relations between any pairs of projected vectors (xi, xj) are very good approximations of the original document vector pair (ni, nj) for as long as m is at least 100. Given r as the number of 1s per column in the random projection matrix, m as the number of dimensions in the compressed input vector, and n as the original number of keywords prior to random projection. Each term is randomly mapped to r dimensions. Each dimension, in turn, is associated with approximately rn/m terms. In our experiments with the Reuters collection, we used m=315, r=5 and n=2,920. Before we describe our keyword extraction procedure, we need to be clear as to what a good keyword is. In general, we want these keywords to be meaningful labels for the individual units of the map so that a user who browses a WEBSOM-based text archive may have as good a picture as possible of the contents of the documents assigned to the individual units. We adopt here the two principles used in Lagus [9] that intuitively define a meaningful label for a unit in a trained WEBSOM. A term w is a meaningful label for a document cluster C in a trained WEBSOM if 1) w is prominent in C compared to other words in C; and 2) w is prominent in C compared to the other occurrences of w in the whole collection. The distribution of the weights of every map unit relative to the weight distributions of other units in the map determines where the various text documents are associated during archiving. Those terms mapped to high weight values are more significant than those mapped to lower valued weights. In other words, terms mapped to high weight values are the potential keywords for the documents associated to a given map unit. But since we used a random projection matrix, each weight component has numerous terms mapped to it. Thus, there is no straightforward way to determine which are the keywords that truly contribute significantly to the high weight value of a map unit. If we study how the random projection method works, however, we would be able to trace back the various combinations of terms that contribute to each dimension in the compressed input vector. From these combinations, we can deduce the set of truly significant keywords as follows: 1. For every dimension, compute the mean weight μ and standard deviation σ among all the map units. Weight values that exceed μ+zσ are significantly high for the given dimension. For example, weights greater than μ+zσ, at z=1.645, have 95% confidence of being significantly higher than the mean. Higher z-values imply higher confidence levels. 2. Every time a certain dimension d is found to be significantly high, it is likely that only one of the rn/m terms mapped to it has truly contributed significantly to the high weight of that unit. The rest of the terms are just “piggy-back” terms. 3. Since the random projection method randomly assigns each keyword to r different dimensions, then the truly significant keywords will consistently contribute high weights to the r dimensions. If we count how many of each term’s randomly projected dimensions are significantly high, the count is close to r for truly significant keywords. 4. By sorting the different keywords in decreasing order of their accumulated weights, the truly significant weights will be at the top of the sorted lists. 5. Therefore, if we want the k most important keywords per unit, we take the top k terms in the Procedure 1. Keyword extraction procedure for d = 1 to m if wqd ≥ μd + z.σd for j = 1 to n if RPM[d][j] = 1 add 1 to tallyFreq [j] add wqd to sumWeights [j] endif endfor endif
منابع مشابه
Document Classification and Visualisation to Support the Investigation of Suspected Fraud
This position paper reports on ongoing work where three clustering and visualisation techniques for large document collections – developed at the Joint Research Centre (JRC) – are applied to textual data to support the European Commission’s investigation on suspected fraud cases. The techniques are (a) an implementation of the neural network application WEBSOM, (b) hierarchical cluster analysis...
متن کاملDocument Classi cation and Visualisation to Support the Investigation of Suspected Fraud
This position paper reports on ongoing work where three clustering and visualisation techniques for large document collections { developed at the Joint Research Centre (JRC) { are applied to textual data to support the European Commission's investigation on suspected fraud cases. The techniques are (a) an implementation of the neural network application WEBSOM, (b) hierarchical cluster analysis...
متن کاملWord-Streams for Representing Context in Word Maps
The most prominent use of Self-Organizing Maps (SOMs) in text archiving and retrieval is the WEBSOM. In WEBSOM, a map is first used to reduce the dimensionality of the huge term frequency table by training a so-called word-category map. This wordcategory map is then used to convert the individual documents into their respective document signatures (i.e. histogram of words) which form the basis ...
متن کاملKeyword and Keyphrase Extraction Techniques: A Literature Review
In this paper we present a survey of various techniques available in text mining for keyword and keyphrase extraction.
متن کاملKeyword Extraction From Chinese Text Based On Multidimensional Weighted Features
This paper proposed to solve the problems of incomplete coverage and low accuracy in keyword extraction of Chinese text based on intrinsic feature of the Chinese language and an extraction method of multidimensional information weighted eigenvalues. This method combined theoretical analysis and experimental calculation to study the parts of speech, word position, word length, semantic similarit...
متن کامل